Explore Red Wine Quality by Nan Li

Introduction

In this project, we use R and apply exploratory data analysis (EDA) techniques to explore the dataset of wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines. And we also produce refined plots to illustrate interesting relationships in the data. The background information of the data is available at this link and descriptions of data is here.

Descriptive statistics

We’ll start with the data structure first. The dataset contains 13 variables and 1599 observations. For each variable, we also have its descriptive statistics for initial observations.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

X and quality are discrete variables. All other variables seem to be continuous numerical quantities. From the variable names and descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may have correlations with each other.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Since we are primarily interested in quality variable, it would also be interesting to notice the basic statistics on that as well. quality is an ordered, categorical, discrete variable. From the literature, this was on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.6 and median of 6.

Univariate Plots Section

We’ll draw quick histograms for these 12 variables and see the pattern for each distributions.

Histograms

Box-Plots

We also draw Boxplots for each variables as another indicator of the distributions.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 13 variables and 1599 observations. It appears that density and pH are normally distributed, with few outliers. Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed.

What is/are the main feature(s) of interest in your dataset?

The most interesting factor of this dataset is quality. It has a discrete range of 3-8, we can roughly see that there is normal distribution pattern. A large majority of the wines examined received ratings of 5 or 6, and very few received 3, 4, or 8.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The variables are the physicochemical attributes of red wine, so with basic chemistry background, the concentration of one chemical may have correlation with other relative chemicals or chemicals with similar components or structure. For example, there are three different acidity attributes, and as pH is defined as a numeric scale used to specify the acidity, so pH could be regarded as the characteristic of wine acidity.

Did you create any new variables from existing variables in the dataset?

For further investigation, we plan to create new ordered variable for quality, as it will be more convenient to use in the bivariate or multivariate analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

From the box-plots charts, we can see that all variables have outliers, and mostly outliers are on the larger side. Residual sugar and chlorides have extreme outliers. Citric acid have a large number of zero values. Alcohol has an irregular shaped distribution but it does not have pronounced outliers.

In order to see more details about the distribution trend of each variable, we can adjust the binwidth, choose proper scale or eliminate the outliers to tidy the data for a smoother visualization.

Box-plot Statistics

To adjust the We’ll use the statistics of the boxplot as the x-scale range, so that some of the outliers could be eliminated. Finer histograms of each variable are shown below. For variable Residual_Sugar and choloride, as it is long-tail skewed, we also draw the histogram in log scale base for smoother distribution.

## [1]  4.6  7.1  7.9  9.2 12.3
## [1] 0.12 0.39 0.52 0.64 1.01
## [1] 0.00 0.09 0.26 0.42 0.79
## [1] 0.90 1.90 2.20 2.60 3.65
## [1] 0.041 0.070 0.079 0.090 0.119
## [1]  1  7 14 21 42
## [1]   6  22  38  62 122
## [1] 0.33 0.55 0.62 0.73 0.99
## [1] 0.992350 0.995600 0.996750 0.997835 1.001000
## [1] 2.93 3.21 3.31 3.40 3.68
## [1]  8.4  9.5 10.2 11.1 13.5

Acidity

Residual Sugar

Chlorides

Sulfur Dioxide

Density & pH

Sulphates

Alcohol


Bivariate Plots Section

In order to investigate the relationship between two variables, we’d start with calculating the correlations between each variable in the database, then pick the pairs of variables with stronger correlations for further analysis.

Below is another correlation plot, the filled circle shows the strength of correlation between two variable, bigger size with darker color indicates stronger correlation, while smaller and brighter circle indicates weaker correlation.

From the correlations charts, we can see that some correlation in pairs with stronger correlations like:

fixed.acidity vs. citric.acid fixed.acidity vs. pH fixed.acidity vs. density volatile.acidity vs. citric.acid free.sulfur.dioxide vs total.sulfur.dioxide chlorides vs. sulphates alcohol vs. density quality vs. alcohol

Create new ordered quality variable for later analysis. The original quality variable is in integer format, but the new one is in categorical factor format, so that the dataset will be categorized into six groups with the label quality.

##  Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Acidity

Acidity vs. pH

Free SO2 vs. Total SO2

Alcohol vs.Density

In this plot, we can see there correlation between volatile.acidity vs.alcohol, the correlation factor is -0.202288.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2488416 -0.1548020
## sample estimates:
##       cor 
## -0.202288

Another density/alcohol correlation plot, the y-axis of purple line represent the median density for wine with same alcohol level. Sample statistics are also shown as below.

## # A tibble: 6 × 4
##   alcohol density_mean density_median     n
##     <dbl>        <dbl>          <dbl> <int>
## 1    8.40    1.0001000        1.00010     2
## 2    8.50    0.9991400        0.99914     1
## 3    8.70    0.9977500        0.99775     2
## 4    8.80    1.0024200        1.00242     2
## 5    9.00    0.9984173        0.99780    30
## 6    9.05    0.9958500        0.99585     1

Sulfate vs. Alcohol

We also facet the data points by quality to compare the correlation in different quality level.

Quality vs. Alcohol

## wine$quality_factor: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality_factor: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality_factor: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality_factor: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality_factor: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality_factor: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Quality vs. Volatile.acidity

## wine$quality_factor: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality_factor: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality_factor: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality_factor: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality_factor: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality_factor: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Wine quality is correlated with level of alcohol and volatile acidity. When volatile acidity decreases, the wine quality increases. For alcohol level of wine, wine quality increase as alcohol level increases, but this trend is not dominating for wine of quality 3,4.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is strong correlation between acidity variables, pH and acidity, free.sulfur.dioxide and total.sulfur.dioxide.

What was the strongest relationship you found?

The strongest correlation is between fixed.acidity and citric.acid, with correlation factor 0.6717.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As we have examined before in the bivariate section, we know that quality has strong correlation with volatile.acidity and alcohol. In this section, we mapped dotted chart for volatile.acidity vs.alcohol and colored the dots by its quality factor, so that in this plot, we can see there correlation between volatile.acidity vs.alcohol, the correlation factor is -0.202288, which is not a strong correlation.


Final Plots and Summary

Plot One

Description One

As alcohol content is an important factor for wine quality, so we choose the histogram as one of the three plots. The distribution of alcohol level is relatively left-skewed and the most frequent alcohol level is 9.5.

Plot Two

Description Two

This boxplot demonstrates the relationship between alcohol content and wine quality. Generally, higher alcohol content correlated with higher wine quality. However, the median alcohol level of wine with lower quality (3,4) are almost the same.

Plot Three

Description Three

As the correlation tests show, wine quality was affected most strongly by alcohol and volatile acidity. And we can conclude that better wine would have relative higher alcohol content and lower volatile acidity.


Reflection

Through this exploratory data analysis, we can reach the following conclusions, - Mostly frequent quality levels of red wine are 5 and 6. - When alcohol percentage decreases, density grows. - When fixed acidity increases density increases as well. - Acidity variables are strongly correlated with each other.

According to my investigation I may conclude that the key factors that determine the wine quality are alcohol content and volatile acidity level.

For future exploration of this data I would pick one category of wine (for example, quality level 3-4, 5-6, 7-8) to look at the patterns which can appear in each of these three buckets.